Pivoting and Joining

September 10 + 12, 2024

Jo Hardin

Agenda 2/10/25

  1. Pivoting
  2. Joining

pivoting

image credit: https://www.garrickadenbuie.com/project/tidyexplain/

From wide to long and long to wide

  • pivot_longer() makes the data frame “longer” – many columns into a few columns (more rows):

pivot_longer(data,cols,names_to=,value_to=)

  • pivot_wider() makes the data frame “wider” – a few columns into many columns (fewer rows):

pivot_wider(data,names_from=,values_from=)

From wide to long and long to wide

pivot_longer

pivot_longer will be demonstrated using datasets from GapMinder.

litF represents country, year, and female literacy rate.

library(googlesheets4)
gs4_deauth()
litF <- read_sheet("https://docs.google.com/spreadsheets/d/1hDinTIRHQIaZg1RUn6Z_6mo12PtKwEPFIz_mJVF6P5I/pub?gid=0")

litF

pivot_longer

litF <- litF |> 
  select(country = starts_with("Adult"), everything()) |> 
  pivot_longer(cols = -country,
               names_to = "year", 
               values_to = "litRateF") |> 
  filter(!is.na(litRateF))

litF

pivot_longer

GDP represents country, year, and gdp (in fixed 2000 US$).

GDP <- read_sheet("https://docs.google.com/spreadsheets/d/1RctTQmKB0hzbm1E8rGcufYdMshRdhmYdeL29nXqmvsc/pub?gid=0")

GDP <- GDP |>  
  select(country = starts_with("Income"), everything()) |> 
  pivot_longer(cols = -country, 
               names_to = "year",
               values_to = "gdp") |> 
  filter(!is.na(gdp))

GDP

pivot_wider

pivot_wider will be demonstrated using the babynames dataset.

library(babynames)

babynames

pivot_wider babynames

babynames |>  
  select(-prop) |> 
  pivot_wider(names_from = sex, values_from = n) 

pivot_wider babynames

babynames |>  
  select(-prop) |>  
  pivot_wider(names_from = sex, values_from = n) |> 
  filter(!is.na(F), !is.na(M)) |> 
  arrange(desc(year), desc(F))

pivot_wider babynames

babynames |>  
  pivot_wider(names_from = sex, values_from = n) |> 
  mutate(maxcount = pmax(F, M, na.rm = TRUE)) |> 
  arrange(desc(maxcount))

Agenda 2/12/25

  1. Joining

_joining

Relational data (multiple data frames)

See the Posit cheatsheets on wrangling & joining and pivoting.

Joining two (or more) data frames:

  • left_join returns all rows from the left table, and any rows with matching keys from the right table.
  • inner_join returns only the rows in which the left table have matching keys in the right table (i.e., matching rows in both sets).
  • full_join returns all rows from both tables, join records from the left which have matching keys in the right table.

Good practice: always specify the by argument when joining data frames.

Women in Science

10 women in science who changed the world (source: Discover Magazine)1

name profession
Ada Lovelace Mathematician
Marie Curie Physicist and Chemist
Janaki Ammal Botanist
Chien-Shiung Wu Physicist
Katherine Johnson Mathematician
Rosalind Franklin Chemist
Vera Rubin Astronomer
Gladys West Mathematician
Flossie Wong-Staal Virologist and Molecular Biologist
Jennifer Doudna Biochemist

Inputs

professions
dates
works

Desired output

We’d like to put together the data to look like:

Inputs, reminder

nrow(professions)
[1] 10
nrow(dates)
[1] 8
nrow(works)
[1] 9
names(professions)
[1] "name"       "profession"
names(dates)
[1] "name"       "birth_year" "death_year"
names(works)
[1] "name"      "known_for"

Setup

For the next few slides…

x
y

left_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
left_join(x, y, by = "id")

left_join()

professions |> 
  left_join(dates, by = "name") 

right_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
right_join(x, y, by = "id")

right_join()

professions |> 
  right_join(dates, by = "name") 

full_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
full_join(x, y, by = "id")

full_join()

dates |> 
  full_join(works, by = "name") 

inner_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
inner_join(x, y, by = "id")

inner_join()

dates |> 
  inner_join(works, by = "name") 

semi_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
semi_join(x, y, by = "id")

semi_join()

dates |> 
  semi_join(works, by = "name") 

anti_join()

image credit: https://www.garrickadenbuie.com/project/tidyexplain/
anti_join(x, y, by = "id")

anti_join()

dates |> 
  anti_join(works, by = "name") 

Putting it all together

professions |> 
  left_join(dates, by = "name") |> 
  left_join(works, by = "name")

Practice litF and GDP from Gapminder.

left

litGDPleft <- left_join(litF, GDP, by=c("country", "year"))
dim(litGDPleft)
[1] 571   4
litGDPleft

inner

litGDPinner <- inner_join(litF, GDP, by=c("country", "year"))
dim(litGDPinner)
[1] 505   4
litGDPinner

full

litGDPfull <- full_join(litF, GDP, by=c("country", "year"))
dim(litGDPfull)
[1] 8054    4
litGDPfull

join to merge two datasets

If you ever need to understand which join is the right join for you, try to find an image that will lay out what the function is doing. I found this one that is quite good and is taken from the Statistics Globe blog.